4.9. Uploading Efficiently Using Blocks
azbackup isn’t only about cryptography and security—it is also
about providing a good backup experience (after all, it has the word
backup in its name). The straightforward way to
back up encrypted data to the cloud is to initiate a “Create blob”
operation and start uploading data.
However, there are two downsides to doing things this way. First,
uploads are limited to 64 MB with a single request. Backups of huge
directories will often be larger than 64 MB. Second, making one long
request means not only that you’re not making use of all the bandwidth
available to you, but also that you’ll have to restart from the
beginning in case a request fails.
The first order of business is to add support in storage.py for adding a block and committing a
block list. Example 10
shows the code to do this.
Example 10. Block support in storage.py
def put_block(self, container_name, blob_name, block_id, data):
# Take a block id and construct a URL-safe, base64 version base64_blockid = base64.encodestring(str(block_id)).strip() urlencoded_base64_blockid = urllib.quote(base64_blockid)
# Make a PUT request with the block data to blob URI followed by # ?comp=block&blockid=<blockid> return self._do_store_request("/" + container_name + "/" + \ blob_name + \ "?comp=block&blockid=" + \ urlencoded_base64_blockid, \ 'PUT', {}, data)
def put_block_list(self, container_name, blob_name, \ block_list, content_type): headers = {} if content_type is not None: headers["Content-Type"] = content_type
# Begin XML content xml_request = "<?xml version=\"1.0\" encoding=\"utf-8\"?><BlockList>"
# Concatenate block ids into block list for block_id in block_list: xml_request += "<Block>" + \ base64.encodestring(str(block_id)).strip() + "</Block>"
xml_request += "</BlockList>"
# Make a PUT request to blob URI followed by ?comp=blocklist return self._do_store_request("/" + container_name + \ "/" + blob_name + \ "?comp=blocklist", 'PUT',\ headers, xml_request)
|
We covered the XML and the URI format in detail earlier in this
chapter. Since the XML constructed is fairly trivial and with
well-defined character ranges, the code can hand-construct it instead of
using Python’s XML support.
With this support in place, azbackup can now chop up the encrypted archive
into small blocks, and call the previous two functions to upload them.
Instead of uploading the blocks in sequence, they can be uploaded in
parallel, speeding up the process. Example 11
shows the entire code.
Example 11. The azbackup block upload code
def upload_archive(data, filename, account, key): conn = storage.Storage("blob.core.windows.net",account, key)
# Try and create container. Will harmlessly fail if already exists conn.create_container("enc", False)
# Heuristics for blocks # We're pretty hardcoded at the moment. We don't bother using blocks # for files less than 4MB. if len(data) < 0:# 4 * 1024 * 1024: resp = conn.put_blob("enc", filename, data,"application/octet-stream") else: resp = upload_archive_using_blocks(data, filename, conn)
if not (resp.status >=200 and resp.status < 400): # Error! No error handling at the moment print resp.status, resp.reason, resp.read() sys.exit(1)
def upload_archive_using_blocks(data, filename, conn):
blocklist=[]
queue = Queue.Queue() if parallel_upload: # parallel_upload specifies whether blocks should be uploaded # in parallel and is set from the command line. for i in range(num_threads): t = task.ThreadTask(queue) t.setDaemon(True) # Run even without workitems t.start()
offset =0
# Block uploader function used in thread queue def block_uploader(connection, block_id_to_upload,\ block_data_to_upload): resp = connection.put_block("enc", filename, block_id_to_upload,\ block_data_to_upload) if not( resp.status>=200 and resp.status <400): print resp.status, resp.reason, resp.read() sys.exit(1) # Need retry logic on error
while True:
if offset>= len(data): break
# Get size of next block. Process in 4MB chunks data_to_process = min( 4*1024*1024, len(data)-offset)
# Slice off next block. Generate an SHA-256 block id # In the future, we could use it to see whether a block # already exists to avoid re-uploading it
block_data = data[offset: offset+data_to_process] block_id = hashlib.sha256(block_data).hexdigest() blocklist.append(block_id)
if parallel_upload: # Add work item to the queue. queue.put([block_uploader, [conn, block_id, block_data]]) else:
block_uploader(conn, block_id, block_data)
# Move i forward offset+= data_to_process
# Wait for all block uploads to finish queue.join()
# Now upload block list resp = conn.put_block_list("enc", filename, \ blocklist, "application/octet-stream") return resp
|
The action kicks off in upload_archive. If the input data is less than 4
MB, the code makes one long sequential request. If it is greater than 4
MB, the code calls a helper function to split and upload the data into
blocks. These numbers are chosen somewhat arbitrarily. In a real
application, you should test on your target hardware and network to see
what sizes and block splits work best for you.
The upload_archive_using_blocks function takes
care of splitting the input data into 4 MB blocks (again, another
arbitrary size chosen after minimal testing). For block IDs, an SHA-256
hash of the data in the block is used. Though the code doesn’t support
it as of this writing, it would be easy to add a feature that checks
whether a block of data already exists in the cloud (using the SHA-256
hash and the GetBlockIds operation)
before uploading it.
Each block is added into a queue that a pool of threads process.
Since Python doesn’t have a built-in thread pool implementation, a
simple one that lives in task.py is
included in the source code. (The source isn’t shown here, since it
isn’t directly relevant to this discussion.) This manages a set of
threads that read work items of a queue and process them. Tweaking the
number of threads for your specific environment is imperative for good
upload performance.
In this case, the “work item” is a function reference (the inner
function block_uploader) and a Python
tuple containing arguments to that function. When it is time to be
processed, block_uploader gets called
with the arguments contained in the tuple (a storage connection object,
a block ID, and the data associated with that block ID). block_updater calls put_block in the storage module to upload that
specific block.
Uploading block uploads in parallel not only provides better
performance, but also provides the flexibility to change this code later
to support retries on error scenarios, complex back-off strategies, and
several other features.
5. Usage
All of this work wouldn’t be much fun if it weren’t useful and easy
to use, would it? Using azbackup is actually quite simple. It has a few
basic options (parsed using code not shown here), and the workflow is
fairly straightforward.
From start to finish, here are all the steps you take to back up
data to the cloud and restore encrypted backups:
Set the environment variables AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY to your Windows Azure
storage account name and key, respectively. For example, if your blog
storage account was at foo.blob.core.windows.net,
your AZURE_STORAGE_ACCOUNT should
be set to foo. The tool
automatically looks for these environment variables to connect to blob
storage.
Run python azbackup-gen-key
-k keyfilepath, where
keyfilepath is the path and filename where
your RSA key pairs will be stored.
Warning: Do not lose this file. If you do, you will
lose access to data backed up with this tool, and there’s no way to get
back the data.
To create a new backup, run python
azbackup.py -c -k keyfilepath
-f
archive_name
directory_to_be_backed_up, where
keyfilepath is the key file from the
previous step, archive is the name of the
archive that the tool will generate, and
directory_to_be_backed_up is the path of
the directory you want backed up. Depending on the size of the
directory, this might take some time because this tool isn’t really
optimized for speed at the moment. If no errors are shown, the tool
will exit silently when the upload is finished.
To extract an existing backup, run python azbackup.py -x -k
keyfilepath -f archive_name.
This will extract the contents of the backup to the current
directory.
All tools take an -h
parameter to show you usage information.